Journal of Biomedical Informatics — Latest Matching Preprints

1

Mining for Health: A Comparison of Word Embedding Methods for Analysis of EHRs Data

Getzen, E.; Ruan, Y.; Ungar, L.; Long, Q.

2022-03-08 health informatics 10.1101/2022.03.05.22271961 medRxiv

Top 0.1%

48.9%

Show abstract

Electronic health records (EHRs), routinely collected as part of healthcare delivery, offer great promise for advancing precision health. At the same time, they present significant analytical challenges. In EHRs, data for individual patients are collected at irregular time intervals and with varying frequencies; they include both structured and unstructured data. Advanced statistical and machine learning methods have been developed to tackle these challenges, for example, for predicting diagnoses earlier and more accurately. One powerful tool for extracting useful information from EHRs data is word embedding algorithms, which represent words as vectors of real numbers that capture the words semantic and syntactic similarities. Learning embeddings can be viewed as automated feature engineering, producing features that can be used for predictive modeling of medical events. Methods such as Word2Vec, BERT, FastText, ELMo, and GloVe have been developed for word embedding, but there has been little work on re-purposing these algorithms for the analysis of structured medical data. Our work seeks to fill this important gap. We extended word embedding methods to embed (structured) medical codes from a patients entire medical history, and used the resultant embeddings to build prediction models for diseases. We assessed the performance of multiple embedding methods in terms of predictive accuracy and computation time using the Medical Information Mart for Intensive Care (MIMIC) database. We found that using Word2Vec, Fast-Text, and GloVe algorithms yield comparable models, while more recent contextual embeddings provide marginal further improvement. Our results provide insights and guidance to practitioners regarding the use of word embedding methods for the analysis of EHR data.

2

Large Language Models for Social Determinants of Health Information Extraction from Clinical Notes - A Generalizable Approach across Institutions

Keloth, V. K.; Selek, S.; Chen, Q.; Gilman, C.; Fu, S.; Dang, Y.; Chen, X.; Hu, X.; Zhou, Y.; He, H.; Fan, J. W.; Wang, K.; Brandt, C.; Tao, C.; Liu, H.; Xu, H.

2024-05-22 health informatics 10.1101/2024.05.21.24307726 medRxiv

Top 0.1%

43.1%

Show abstract

The consistent and persuasive evidence illustrating the influence of social determinants on health has prompted a growing realization throughout the health care sector that enhancing health and health equity will likely depend, at least to some extent, on addressing detrimental social determinants. However, detailed social determinants of health (SDoH) information is often buried within clinical narrative text in electronic health records (EHRs), necessitating natural language processing (NLP) methods to automatically extract these details. Most current NLP efforts for SDoH extraction have been limited, investigating on limited types of SDoH elements, deriving data from a single institution, focusing on specific patient cohorts or note types, with reduced focus on generalizability. This study aims to address these issues by creating cross-institutional corpora spanning different note types and healthcare systems, and developing and evaluating the generalizability of classification models, including novel large language models (LLMs), for detecting SDoH factors from diverse types of notes from four institutions: Harris County Psychiatric Center, University of Texas Physician Practice, Beth Israel Deaconess Medical Center, and Mayo Clinic. Four corpora of deidentified clinical notes were annotated with 21 SDoH factors at two levels: level 1 with SDoH factor types only and level 2 with SDoH factors along with associated values. Three traditional classification algorithms (XGBoost, TextCNN, Sentence BERT) and an instruction tuned LLM-based approach (LLaMA) were developed to identify multiple SDoH factors. Substantial variation was noted in SDoH documentation practices and label distributions based on patient cohorts, note types, and hospitals. The LLM achieved top performance with micro-averaged F1 scores over 0.9 on level 1 annotated corpora and an F1 over 0.84 on level 2 annotated corpora. While models performed well when trained and tested on individual datasets, cross-dataset generalization highlighted remaining obstacles. To foster collaboration, access to partial annotated corpora and models trained by merging all annotated datasets will be made available on the PhysioNet repository.

3

Using Large Language Models to Annotate Complex Cases of SDoH in Longitudinal Clinical Records

Ralevski, A.; Taiyab, A.; Nossal, M.; Mico, L.; Piekos, S.; Hadlock, J. J.

2024-04-27 public and global health 10.1101/2024.04.25.24306380 medRxiv

Top 0.1%

42.1%

Show abstract

Social Determinants of Health (SDoH) are an important part of the exposome and are known to have a large impact on variation in health outcomes. In particular, housing stability is known to be intricately linked to a patients health status, and pregnant women experiencing housing instability (HI) are known to have worse health outcomes. Most SDoH information is stored in electronic health records (EHRs) as free text (unstructured) clinical notes, which traditionally required natural language processing (NLP) for automatic identification of relevant text or keywords. A patients housing status can be ambiguous or subjective, and can change from note to note or within the same note, making it difficult to use existing NLP solutions. New developments in NLP allow researchers to prompt LLMs to perform complex, subjective annotation tasks that require reasoning that previously could only be attempted by human annotators. For example, large language models (LLMs) such as GPT (Generative Pre-trained Transformer) enable researchers to analyze complex, unstructured data using simple prompts. We used a secure platform within a large healthcare system to compare the ability of GPT-3.5 and GPT-4 to identify instances of both current and past housing instability, as well as general housing status, from 25,217 notes from 795 pregnant women. Results from these LLMs were compared with results from manual annotation, a named entity recognition (NER) model, and regular expressions (RegEx). We developed a chain-of-thought prompt requiring evidence and justification for each note from the LLMs, to help maximize the chances of finding relevant text related to HI while minimizing hallucinations and false positives. Compared with GPT-3.5 and the NER model, GPT-4 had the highest performance and had a much higher recall (0.924) than human annotators (0.702) in identifying patients experiencing current or past housing instability, although precision was lower (0.850) compared with human annotators (0.971). In most cases, the evidence output by GPT-4 was similar or identical to that of human annotators, and there was no evidence of hallucinations in any of the outputs from GPT-4. Most cases where the annotators and GPT-4 differed were ambiguous or subjective, such as "living in an apartment with too many people". We also looked at GPT-4 performance on de-identified versions of the same notes and found that precision improved slightly (0.936 original, 0.939 de-identified), while recall dropped (0.781 original, 0.704 de-identified). This work demonstrates that, while manual annotation is likely to yield slightly more accurate results overall, LLMs, when compared with manual annotation, provide a scalable, cost-effective solution with the advantage of greater recall. At the same time, further evaluation is needed to address the risk of missed cases and bias in the initial selection of housing-related notes. Additionally, while it was possible to reduce confabulation, signs of unusual justifications remained. Given these factors, together with changes in both LLMs and charting over time, this approach is not yet appropriate for use as a fully-automated process. However, these results demonstrate the potential for using LLMs for computer-assisted annotation with human review, reducing cost and increasing recall. More efficient methods for obtaining structured SDoH data can help accelerate inclusion of exposome variables in biomedical research, and support healthcare systems in identifying patients who could benefit from proactive outreach.

4

Filling the gaps: leveraging large language models for temporal harmonization of clinical text across multiple medical visits for clinical prediction

Choi, I.; Long, Q.; Getzen, E.

2024-05-07 intensive care and critical care medicine 10.1101/2024.05.06.24306959 medRxiv

Top 0.1%

41.2%

Show abstract

Electronic health records offer great promise for early disease detection, treatment evaluation, information discovery, and other important facets of precision health. Clinical notes, in particular, may contain nuanced information about a patients condition, treatment plans, and history that structured data may not capture. As a result, and with advancements in natural language processing, clinical notes have been increasingly used in supervised prediction models. To predict long-term outcomes such as chronic disease and mortality, it is often advantageous to leverage data occurring at multiple time points in a patients history. However, these data are often collected at irregular time intervals and varying frequencies, thus posing an analytical challenge. Here, we propose the use of large language models (LLMs) for robust temporal harmonization of clinical notes across multiple visits. We compare multiple state-of-the-art LLMs in their ability to generate useful information during time gaps, and evaluate performance in supervised deep learning models for clinical prediction.

5

Stroke Risk Prediction from Medical Survey Data: AI-Driven Risk Analysis with Insightful Feature Importance using Explainable AI (XAI)

Akter, S. B.; Akter, S.; Pias, T. S.

2023-11-17 public and global health 10.1101/2023.11.17.23298646 medRxiv

Top 0.1%

38.1%

Show abstract

Prioritizing dataset dependability, model performance, and interoperability is a compelling demand for improving stroke risk prediction from medical surveys using AI in healthcare. These collective efforts are required to enhance the field of stroke risk assessment and demonstrate the transformational potential of AI in healthcare. This novel study leverages the CDCs recently published 2022 BRFSS dataset to explore AI-based stroke risk prediction. Numerous substantial and notable contributions have been established from this study. To start with, the datasets dependability is improved through a unique RF-based imputation technique that overcomes the challenges of missing data. In order to identify the most promising models, six different AI models are meticulously evaluated including DT, RF, GNB, RusBoost, AdaBoost, and CNN. The study combines topperforming models such as GNB, RF, and RusBoost using fusion approaches such as soft voting, hard voting, and stacking to demonstrate the combined prediction performance. The stacking model demonstrated superior performance, achieving an F1 score of 88%. The work also employs Explainable AI (XAI) approaches to highlight the subtle contributions of important dataset features, improving model interpretability. The comprehensive approach to stroke risk prediction employed in this study enhanced dataset reliability, model performance, and interpretability, demonstrating AIs fundamental impact in healthcare.

6

Identifying Sepsis Subphenotypes via Time-Aware Multi-Modal Auto-Encoder

Yin, C.; Liu, R.; Zhang, D.; Zhang, P.

2020-07-29 intensive care and critical care medicine 10.1101/2020.07.26.20162214 medRxiv

Top 0.1%

37.2%

Show abstract

Sepsis is a heterogeneous clinical syndrome that is the leading cause of mortality in hospital intensive care units (ICUs). Identification of sepsis subphenotypes may allow for more precise treatments and lead to more targeted clinical interventions. Recently, sepsis subtyping on electronic health records (EHRs) has attracted interest from healthcare researchers. However, most sepsis subtyping studies ignore the temporality of EHR data and suffer from missing values. In this paper, we propose a new sepsis subtyping framework to address the two issues. Our subtyping framework consists of a novel Time-Aware Multi-modal auto-Encoder (TAME) model which introduces time-aware attention mechanism and incorporates multi-modal inputs (e.g., demographics, diagnoses, medications, lab tests and vital signs) to impute missing values, a dynamic time wrapping (DTW) method to measure patients temporal similarity based on the imputed EHR data, and a weighted k-means algorithm to cluster patients. Comprehensive experiments on real-world datasets show TAME outperforms the baselines on imputation accuracy. After analyzing TAME-imputed EHR data, we identify four novel subphenotypes of sepsis patients, paving the way for improved personalization of sepsis management. CCS CONCEPTS*Applied computing [->] Health informatics; * Social and professional topics [->] Medical records; * Mathematics of computing [->] Time series analysis. ACM Reference FormatChangchang Yin, Ruoqi Liu, Dongdong Zhang, and Ping Zhang. 2020. Identifying Sepsis Subphenotypes via Time-Aware Multi-Modal Auto-Encoder.

7

SmokeBERT: A BERT-based Model for Quantitative Smoking History Extraction from Clinical Narratives to Improve Lung Cancer Screening

Xue, Y.; Zhu, Y.; Zhuang, L.; Oh, Y.; Taira, R.; Aberle, D. R.; Prosper, A. E.; Hsu, W.; Lin, Y.

2025-06-20 health informatics 10.1101/2025.06.18.25329870 medRxiv

Top 0.1%

33.9%

Show abstract

Tobacco use is a critical risk factor for diseases such as cancer and cardiovascular disorders. While electronic health records can capture categorical smoking statuses accurately, granular quantitative details, such as pack years and years since quitting, are often embedded in clinical narratives. This information is crucial for assessing disease risk and determining eligibility for lung cancer screening (LCS). Existing natural language processing (NLP) tools excelled at identifying smoking statuses but struggled with extracting detailed quantitative data. To address this, we developed SmokeBERT, a fine-tuned BERT-based model optimized for extracting detailed smoking histories. Evaluations against a state-of-the-art rule-based NLP model demonstrated its superior performance on F1 scores (0.97 vs. 0.88 on the hold-out test set) and identification of LCS-eligible patients (e.g., 98% vs. 60% for [≥]20 pack years). Future work includes creating a multilingual, language-agnostic version of SmokeBERT by incorporating datasets in multiple languages, exploring ensemble methods, and testing on larger datasets.

8

Contextual Embeddings from Clinical Notes Improves Prediction of Sepsis

Amrollahi, F.; Shashikumar, S.; Razmi, F.; Nemati, S.

2021-03-03 intensive care and critical care medicine 10.1101/2021.03.02.21252779 medRxiv

Top 0.1%

33.5%

Show abstract

Sepsis, a life-threatening organ dysfunction, is a clinical syndrome triggered by acute infection and affects over 1 million Americans every year. Untreated sepsis can progress to septic shock and organ failure, making sepsis one of the leading causes of morbidity and mortality in hospitals. Early detection of sepsis and timely antibiotics administration is known to save lives. In this work, we design a sepsis prediction algorithm based on data from electronic health records (EHR) using a deep learning approach. While most existing EHR-based sepsis prediction models utilize structured data including vitals, labs, and clinical information, we show that incorporation of features based on clinical texts, using a pre-trained neural language representation model, allows for incorporation of unstructured data without an explicit need for ontology-based named-entity recognition and classification. The proposed model is trained on a large critical care database of over 40,000 patients, including 2805 septic patients, and is compared against competing baseline models. In comparison to a baseline model based on structured data alone, incorporation of clinical texts improved AUC from 0.81 to 0.84. Our findings indicate that incorporation of clinical text features via a pre-trained language representation model can improve early prediction of sepsis and reduce false alarms.

9

Using Large Language Models for sentiment analysis of health-related social media data: empirical evaluation and practical tips

He, L.; Omranian, S.; McRoy, S.; Zheng, K.

2024-03-20 health informatics 10.1101/2024.03.19.24304544 medRxiv

Top 0.1%

33.5%

Show abstract

Health-related social media data generated by patients and the public provide valuable insights into patient experiences and opinions toward health issues such as vaccination and medical treatments. Using Natural Language Processing (NLP) methods to analyze such data, however, often requires high-quality annotations that are difficult to obtain. The recent emergence of Large Language Models (LLMs) such as the Generative Pre-trained Transformers (GPTs) has shown promising performance on a variety of NLP tasks in the health domain with little to no annotated data. However, their potential in analyzing health-related social media data remains underexplored. In this paper, we report empirical evaluations of LLMs (GPT-3.5-Turbo, FLAN-T5, and BERT-based models) on a common NLP task of health-related social media data: sentiment analysis for identifying opinions toward health issues. We explored how different prompting and fine-tuning strategies affect the performance of LLMs on social media datasets across diverse health topics, including Healthcare Reform, vaccination, mask wearing, and healthcare service quality. We found that LLMs outperformed VADER, a widely used off-the-shelf sentiment analysis tool, but are far from being able to produce accurate sentiment labels. However, their performance can be improved by data-specific prompts with information about the context, task, and targets. The highest performing LLMs are BERT-based models that were fine-tuned on aggregated data. We provided practical tips for researchers to use LLMs on health-related social media data for optimal outcomes. We also discuss future work needed to continue to improve the performance of LLMs for analyzing health-related social media data with minimal annotations.

10

Zero-shot Large Language Models for Long Clinical Text Summarization with Temporal Reasoning

Kruse, M.; Hu, S.; Derby, N.; Wu, Y.; Stonbraker, S.; Yao, B.; Wang, D.; Goldberg, E.; Gao, Y.

2025-07-23 health informatics 10.1101/2025.07.21.25331947 medRxiv

Top 0.1%

33.4%

Show abstract

Recent advances in large language models (LLMs) have shown potential in clinical text summarization, but their ability to handle long patient trajectories with multi-modal data spread across time remains underexplored. This study systematically evaluates several state-of-the-art open-source LLMs, their Retrieval Augmented Generation (RAG) variants and chain-of-thought (CoT) prompting on long-context clinical summarization and prediction. We examine their ability to synthesize structured and unstructured Electronic Health Records (EHR) data while reasoning over temporal coherence, by re-engineering existing tasks, including discharge summarization and diagnosis prediction from two publicly available EHR datasets. Our results indicate that long context windows improve input integration but do not consistently enhance clinical reasoning, and LLMs are still struggling with temporal progression and rare disease prediction. While RAG shows improvements in hallucination in some cases, it does not fully address these limitations. Our work fills the gap in long clinical text summarization, establishing a foundation for evaluating LLMs with multi-modal data and temporal reasoning.

11

Advancements in Multilingual Biomedical Natural Language Processing: exploring Large Language Models for Named Entity Recognition and Linking

Mazzucato, S.; Seinen, T. M.; Moccia, S.; Micera, S.; Bandini, A.; van Mulligen, E. M.

2026-01-23 health informatics 10.64898/2026.01.22.26344605 medRxiv

Top 0.1%

33.2%

Show abstract

ObjectiveNamed Entity Recognition (NER) and Biomedical Entity Linking (BEL) are essential for transforming unstructured Electronic Health Records (EHRs) into structured information. However, tools for these tasks are limited in non-English biomedical texts such as Dutch and Italian. This study investigates the use of prompt-based learning with Large Language Models (LLMs) to perform multilingual NER and BEL using minimal domainspecific data, while addressing annotation preservation during corpus translation. MethodsAn English-annotated corpus from the ShARe/CLEF dataset was translated into Dutch and Italian using a strategy that embeds annotations directly into the text prior to translation and retrieves them afterwards. GPT-4o was applied in zero-shot and few-shot settings to extract biomedical entities, which were then mapped to Unified Medical Language System Concept Unique Identifiers using contextual word embeddings. Performance was evaluated with precision, recall, and F1-score, and compared with goldstandard clinician annotations. ResultsThe multilingual NER pipeline achieved strong performance, with an overall F1-score of 0.98 across languages. BEL experiments showed reliable entity normalization, with an overall accuracy of 0.91 and a mean reciprocal rank of 0.95. The combined performance of the NER and BEL achieved 0.90 supporting the utility of LLMs in standardizing biomedical concepts across languages. ConclusionPrompt-based LLMs can effectively perform NER and BEL in languages with less annotated resources, even with limited annotated training data. The proposed annotation-preserving translation method, combined with generative and discriminative LLM capabilities, provides a scalable approach to multilingual clinical information extraction. These findings highlight the potential for broader adoption of LLM-based natural language processing systems to support multilingual healthcare data harmonization. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=133 SRC="FIGDIR/small/26344605v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@100c385org.highwire.dtl.DTLVardef@12467d8org.highwire.dtl.DTLVardef@11d9ca5org.highwire.dtl.DTLVardef@1173e3d_HPS_FORMAT_FIGEXP M_FIG C_FIG HighlightsO_LIThis study shows the feasibility of using prompt-based learning with large language models (LLMs) to perform multilingual named entity recognition (NER) and biomedical entity linking (BEL) in Dutch and Italian, two languages with less annotated resources. C_LIO_LIAn annotation-preserving translation strategy was proposed to adapt the ShARe/CLEF eHealth corpus, enabling consistent evaluation across English, Dutch, and Italian without loss of gold-standard annotations. C_LIO_LIThe multilingual NER pipeline achieved strong overall performance (F1-score: 0.89), while BEL experiments showed reliable entity normalization (F1-score: 0.64, MRR: 0.68) to standardized clinical concepts. C_LIO_LIThe approach highlights the potential of generative and discriminative LLM capabilities for scalable multilingual clinical information extraction, supporting broader European initiatives for cross-lingual health data harmonization. C_LI

12

Using natural language processing to study homelessness longitudinally with electronic health record data subject to irregular observations

Chapman, A. B.; Scharfstein, D. O.; Montgomery, A. E.; Byrne, T.; Suo, Y.; Effiong, A.; Velasquez, T.; Pettey, W.; Nelson, R.

2023-03-18 health informatics 10.1101/2023.03.17.23287414 medRxiv

Top 0.1%

33.2%

Show abstract

The Electronic Health Record (EHR) contains information about social determinants of health (SDoH) such as homelessness. Much of this information is contained in clinical notes and can be extracted using natural language processing (NLP). This data can provide valuable information for researchers and policymakers studying long-term housing outcomes for individuals with a history of homelessness. However, studying homelessness longitudinally in the EHR is challenging due to irregular observation times. In this work, we applied an NLP system to extract housing status for a cohort of patients in the US Department of Veterans Affairs (VA) over a three-year period. We then applied inverse intensity weighting to adjust for the irregularity of observations, which was used generalized estimating equations to estimate the probability of unstable housing each day after entering a VA housing assistance program. Our methods generate unique insights into the long-term outcomes of individuals with a history of homelessness and demonstrate the potential for using EHR data for research and policymaking.

13

Combining multimorbidity clustering with limited demographic information enables high-precision outcome predictions

Ferreira, F. S.; Le Lannou, E.; Post, B.; Haar, S.; Kadiverlu, B.; Brett, S. J.; Faisal, A. A.

2024-05-28 health informatics 10.1101/2024.05.28.24308024 medRxiv

Top 0.1%

33.1%

Show abstract

Multimorbidity, the coexistence of multiple health conditions in individuals, is prevalent and increasing worldwide, proving to be a growing challenge for patients and the healthcare systems. Furthermore, the prevalence of multimorbidity contributes to an increased risk of hospital admission or even death. In this study, we employ a principled approach that utilises longitudinal data routinely collected in electronic health records linked to half a million people from the UK biobank to generate digital comorbidity fingerprints (DCFs) using a topic modelling approach, Latent Dirichlet Allocation. These comorbidity fingerprints summarise a patients full secondary care clinical history, i.e. their comorbidities and past interventions. We identified 18 clinically relevant DCFs, which captured nuanced combinations of diseases and risk factors, e.g. grouping cardiovascular disorders with common risk factors but also novel groupings that are not obvious and differ in both their breadth and depth from existing observational disease associations. The DCFs, combined with demographic characteristics, performed on par or outperformed traditional models of all-cause mortality or hospital admission, showcasing the potential of data-driven strategies in healthcare forecasting. The comorbidity fingerprints together with age and number of hospital admissions were shown to be the most important factors in the predictions. Additionally, our DCF approach showed robust and consistent performance over time. Our findings underscore the promising role of interpretable data-driven approaches in healthcare forecasting, suggesting improved risk profiling for individual clinical decisions and targeted public health interventions, with consistent and robust performance over time. Author summaryThis study addresses the global challenge of multimorbidity, the presence of multiple health conditions in individuals, which is on the rise and poses a significant burden on patients and healthcare systems. Investigating its impact on the risk of hospitalization or mortality, we employ a sophisticated approach using longitudinal data from the UK Biobank to create digital comorbidity fingerprints (DCFs) through natural language processing methods. These DCFs, summarizing a patients complete clinical history, reveal 18 clinically relevant patterns, including unique combinations of diseases and risk factors. When combined with patient demographic and lifestyle data, the DCF approach performs similarly to traditional models in predicting all-cause mortality or hospitalization. Notably, the DCF approach demonstrates robust and consistent performance over time, highlighting its potential for enhancing healthcare forecasting. These findings emphasize the value of interpretable data-driven strategies in healthcare, offering improved risk profiling for individual clinical decisions and targeted public health interventions with enduring reliability.

14

Extracting social determinants of health from electronic health records: development and comparison of rule-based and large language models-based methods

Wang, B.; Kabir, D.; Clark, C. R.; Choi, K. W.; Smoller, J. W.

2025-11-17 health informatics 10.1101/2025.11.15.25339520 medRxiv

Top 0.1%

33.1%

Show abstract

ObjectivesSocial determinants of health (SDoH) are critical drivers of health outcomes but are often under-documented in structured electronic health record data. This study aimed to develop and evaluate scalable methods for extracting seven SDoH domain categories and 23 subcategories from unstructured clinical notes using both rule-based and large language model (LLM)-based approaches. MethodsWe constructed a gold-standard SDoH corpus comprising clinical text segments from 171 patients in the Mass General Brigham Research Patient Data Registry. A rule-based system (RBS) was developed and its performance compared with seven OpenAI GPT models (GPT-4o, 4.1, 4.1-mini, o4-mini, GPT-5, GPT-5-mini, and o3) under zero-shot and few-shot settings with multiple prompting strategies. We also implemented ensemble models combining RBS and LLM outputs via late fusion. ResultsThe RBS achieved the highest precision for SDoH domain categories (0.97) but substantially lower recall (0.62). GPT-based models outperformed RBS in overall F1 scores, with GPT-5 and GPT-5-mini (few-shot) achieving the best domain-level F1 of 0.88 and o4-mini achieving the highest subcategory F1 of 0.79. The RBS-GPT ensemble improved domain-level performance to 0.89 F1 with balanced precision (0.90) and recall (0.89). Model performance was consistent across demographic groups. ConclusionState-of-the-art GPT models with advanced reasoning capabilities, including the recently released "mini" models (e.g., o4-mini and GPT-5-mini), demonstrated robust performance for SDoH extraction without fine-tuning and outperformed rule-based NLP. Integrating rule-based and LLM approaches further enhanced performance. Our results provide a scalable, cost-efficient framework for accurate identification of SDoH from clinical text, supporting downstream population health research and clinical informatics applications.

15

Can NLP Detect Loneliness in Electronic Health Records? A Proof-of-Concept Study

Park, T.; Habibi, S.; Lowers, J.; Sarker, A.; Bozkurt, S.

2026-04-11 health informatics 10.64898/2026.04.08.26350462 medRxiv

Top 0.1%

33.1%

Show abstract

Loneliness is clinically important but under-documented in electronic health records (EHRs), posing challenges for secondary use and computational phenotyping. This study evaluated whether natural language processing (NLP) methods can detect and classify loneliness severity from clinical notes. Patients with a loneliness survey (mild, moderate, severe) were identified, and notes within six months prior to the survey were retrieved. An expert-expanded lexicon was applied, and transformer models (RoBERTa, ClinicalBERT, Longformer) were fine-tuned for loneliness severity classification. Large language model-based summarization of social and psychiatric history was also tested as an alternative input representation. Performance was evaluated using accuracy, weighted-F1, and per-class F1. All models achieved modest accuracy (0.3 to 0.7), and struggled to identify severe loneliness, reflecting sparse and inconsistent documentation even among surveyed patients. While summarization marginally improved accuracy, gains primarily reflected mild predictions. Manual review of 100 social worker notes from severely lonely patients found explicit mentions of loneliness in only two cases, confirming that relevant documentation is exceedingly rare. These findings demonstrate that model performance is constrained by the sparse and inconsistent documentation of loneliness in EHRs, rather than by deficiencies in the modeling approach itself.

16

Scaling text de-identification using locally augmented ensembles

Murugadoss, K.; Kilamsetty, S.; Doddahonnaiah, D.; Iyer, N.; Pencina, M.; Ferranti, J.; Halamka, J.; Malin, B. A.; Ardhanari, S.

2024-06-20 health informatics 10.1101/2024.06.20.24308896 medRxiv

Top 0.1%

32.6%

Show abstract

The natural language text in electronic health records (EHRs), such as clinical notes, often contains information that is not captured elsewhere (e.g., degree of disease progression and responsiveness to treatment) and, thus, is invaluable for downstream clinical analysis. However, to make such data available for broader research purposes, in the United States, personally identifiable information (PII) is typically removed from the EHR in accordance with the Privacy Rule of the Health Insurance Portability and Accountability Act (HIPAA). Automated de-identification systems that mimic human accuracy in identifier detection can enable access, at scale, to more diverse de-identified data sets thereby fostering robust findings in medical research to advance patient care. The best performing of such systems employ language models that require time and effort for retraining or fine tuning for newer datasets to achieve consistent results and revalidation on older datasets. Hence, there is a need to adapt text de-identification methods to datasets across health institutions. Given the success of foundational large language models (LLMs), such as ChatGPT, in a wide array of natural language processing (NLP) tasks, they seem a natural fit for identifying PII across varied datasets. In this paper, we introduce locally augmented ensembles, which adapt an existing PII detection ensemble method trained at one health institution to others by using institution-specific dictionaries to capture location specific PII and recover medically relevant information that was previously misclassified as PII. We augment an ensemble model created at Mayo Clinic and test it on a dataset of 15,716 clinical notes at Duke University Health System. We further compare the task specific fine tuned ensemble against LLM based prompt engineering solutions on the 2014 i2b2 and 2003 CoNLL NER datasets for prediction accuracy, speed and cost. On the Duke notes, our approach achieves increased recall and precision of 0.996 and 0.982 respectively compared to 0.989 and 0.979 respectively without the augmentation. Our results indicate that LLMs may require significant prompt engineering effort to reach the levels attained by ensemble approaches. Further, given the current state of technology, they are at least 3 times slower and 5 times more expensive to operate than the ensemble approach.

17

Comparing natural language processing representations of disease sequences for prediction in the electronic healthcare record

Beaney, T.; Jha, S.; Alaa, A.; Smith, A.; Clarke, J.; Woodcock, T.; Majeed, A.; Aylin, P.; Barahona, M.

2023-11-16 health informatics 10.1101/2023.11.16.23298640 medRxiv

Top 0.1%

32.6%

Show abstract

Natural language processing (NLP) is increasingly being applied to obtain unsupervised representations of electronic healthcare record (EHR) data, but their performance for the prediction of clinical endpoints remains unclear. Here we use primary care EHRs from 6,286,233 people with Multiple Long-Term Conditions in England to generate vector representations of sequences of disease development using two input strategies (212 disease categories versus 9,462 diagnostic codes) and different NLP algorithms (Latent Dirichlet Allocation, doc2vec and two transformer models designed for EHRs). We also develop a new transformer architecture, named EHR-BERT, which incorporates socio-demographic information. We then compare use of each of these representations to predict mortality, healthcare use and new disease diagnosis. We find that representations generated using disease categories perform similarly to those using diagnostic codes, suggesting models can equally manage smaller or larger vocabularies. Sequence-based algorithms perform consistently better than bag-of-words methods, with the highest performance for EHR-BERT.

18

Large Language Models Struggle to Encode Medical Concepts - A Multilingual Benchmarking and Comparative Analysis

Rouhizadeh, H.; Yazdani, A.; Zhang, B.; Vicente Alvarez, D.; Hueser, M.; Vanobberghen, A.; Yang, R.; Li, I.; Walter, A.; Teodoro, D.

2025-01-15 health informatics 10.1101/2025.01.15.25320579 medRxiv

Top 0.1%

32.0%

Show abstract

Interoperability in health information systems is crucial for accurate data exchange across environments such as electronic health records, clinical notes, and medical research. The main challenge arises from the wide variation in biomedical concepts, their representation across different systems and languages, and the limited context, complicating data integration and standardization. Inspired by recent advances in large language models (LLMs), this study explores their potential role as biomedical knowledge engineers to (semi-)automate multilingual biomedical concept normalization, a key task for semantic interoperability of medical concepts. We developed a novel multilingual dataset comprising 59104 unique terms mapped to 27280 distinct biomedical concepts, designed to assess language model performance across this task within five European languages: English, French, German, Spanish, and Turkish. We then proposed a multi-stage pipeline based on a retrieve-then-rerank approach using sparse and dense retrievers, rerankers, and fusion approaches, leveraging discriminative and generative LLMs, with a predefined primary knowledge organization system. Our experiments show that the best discriminative model, e5, achieves an accuracy of 71%, surpassing the best generative model, Mistral, by 2% (p-value < 0.001). For semi-automated workflows, e5 maintained superior performance with 82% recall@10 versus Mistrals 78%. Our findings demonstrate a pathway to how LLM-based approaches can advance the normalization of multilingual biomedical terms as well as the limitations of LLMs in encoding biomedical concepts.

19

Bridging the Gap in Health Literacy: Harnessing the Power of Large Language Models to Generate Plain Language Summaries from Biomedical Texts

Salazar-Lara, C.; Arias Russi, A. F.; Manrique, R.

2024-07-03 health informatics 10.1101/2024.07.02.24309847 medRxiv

Top 0.1%

31.8%

Show abstract

Health literacy is essential for individuals to navigate the healthcare system and make informed decisions about their health. Low health literacy levels have been associated with negative health outcomes, particularly among older populations and those financially restricted or with lower educational attainment. Plain language summaries (PLS) are an effective tool to bridge the gap in health literacy by simplifying content found in biomedical and clinical documents, in turn, allowing the general audience to truly understand health-related documentation. However, translating biomedical texts to PLS is time-consuming and challenging, for which they are rarely accessible by those who need them. We assessed the performance of Natural Language Processing (NLP) for systematizing plain language identification and Large Language Models (LLMs), Generative Pre-trained Transformer (GPT) 3.5 and GPT 4, for automating PLS generation from biomedical texts. The classification model achieved high precision (97{middle dot}2%) in identifying if a text is written in plain language. GPT 4, a state-of-the-art LLM, successfully generated PLS that were semantically equivalent to those generated by domain experts and which were rated high in accuracy, readability, completeness, and usefulness. Our findings demonstrate the value of using LLMs and NLP to translate biomedical texts into plain language summaries, and their potential to be used as a supporting tool for healthcare stakeholders to empower patients and the general audience to understand healthcare information and make informed healthcare decisions.

20

Toward Digital Twins in the Intensive Care Unit: A Medication Management Case Study

Eslami, B.; Afshar, M.; Tootooni, M. S.; Miller, T.; Churpek, M. M.; Gao, Y.; Dligach, D.

2024-12-28 intensive care and critical care medicine 10.1101/2024.12.20.24319170 medRxiv

Top 0.1%

29.4%

Show abstract

ObjectiveTo evaluate the efficacy of digital twins developed using a large language model (LLaMA-3), fine-tuned with Low-Rank Adapters (LoRA) on ICU physician notes, and to determine whether specialty-specific training enhances treatment recommendation accuracy compared to other ICU specialties or zero-shot baselines. Materials and MethodsDigital twins were created using LLaMA-3 fine-tuned on discharge summaries from the MIMIC-III dataset, where medications were masked to construct training and testing datasets. The medical ICU dataset (1,000 notes) was used for evaluation, and performance was assessed using BERTScore and ROUGE-L. A zero-shot baseline model, relying solely on contextual instructions without training, was also evaluated. While our approach moves toward digital twin capabilities, it does not incorporate real-time, patient-specific EHR data and can be viewed as an ICU specialty-level language model adaptation. ResultsModels fine-tuned on medical ICU notes achieved the highest BERTScore (0.842), outperforming models trained on other specialties or mixed datasets. Zero-shot models showed the lowest performance, highlighting the importance of training. DiscussionThe findings demonstrate that specialty-specific training significantly improves treatment recommendation accuracy in digital twins compared to generalized or zero-shot approaches. Tailoring models to specific ICU domains strengthens their clinical decision-support capabilities. ConclusionContext-specific fine-tuning of large language models is crucial for developing effective digital twins, offering foundational insights for personalized clinical decision support.